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EXECUTIVE  SUMMARY 


Background 

Running  performance  often  is  used  to  evaluate  aerobic 
capacity.  A  previous  review  addressed  the  question  "What  is  the 
relationship  between  distance  and  validity?"  Bioenergetic  models 
of  running  performance  based  on  established  physiological 
principles  suggested  that  performance  on  longer  runs  would  always 
yield  more  valid  estimates  than  performance  on  shorter  runs.  The 
cumulative  evidence  from  122  studies  contradicted  this 
expectation.  Validity  increased  with  distance  for  shorter  runs, 
but  was  constant  for  distances  ^2km.  Also,  validity  was  lower  for 
fixed-time  runs  that  were  <12  min  duration  than  for  fixed-times 
^12  min. 

Objective 

This  report  was  undertaken  to  determine  whether  the  initial 
findings  could  be  replicated. 

Approach 

The  published  literature  was  searched  to  identify  studies 
of  cardiorespiratory  threshold  measures  (e.g.,  ventilatory 
threshold,  anaerobic  threshold)  and  running  performance.  A  meta¬ 
analysis  was  conducted  on  reported  correlations  between  VOamax  and 
performance  extracted  74  correlations  from  39  studies.  The 
analyses  cross-validated  a  set  of  statistical  models  initially 
developed  and  tested  in  the  earlier  review. 

Results 

The  earlier  findings  replicated  well.  The  model  with 
increasing  validity  up  to  2  km  or  12  min  and  constant  validity 
from  those  criterion  points  onward  was  the  best  representation  of 
the  data.  An  earlier  finding  that  fixed-time  run  tests  (e.g.,  a 
12-min  run)  provided  better  estimates  of  aerobic  capacity  than 
fixed-distance  run  tests  (e.g.,  5  km)  also  replicated  (r  =  .807 
vs .  r  =  .706) . 

Conclusions 

Run  tests  should  be  at  least  2  km  in  distance  or  12  min  in 
duration  to  maximize  validity  as  indicators  of  aerobic  capacity. 
Increasing  distance  or  time  beyond  these  minimum  values  does  not 
improve  run  test  validity  as  an  indicator  of  VOamax*  Fixed-time 
tests  have  higher  average  validity  than  fixed-distance  tests,  so 
a  12-min  run  test  will  maximize  validity  while  minimizing  demands 
on  the  runners. 
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Running  performance  often  is  used  to  evaluate  aerobic 
capacity.  A  previous  review  indicated  that  the  validity  of 
running  performance  increased  with  distance  up  to  2  km;  validity 
was  constant  from  2  km  onward  (Vickers,  2001).^ 

Vickers's  (2001)  review  was  undertaken  to  address  several 
questions.  Do  longer  runs  provide  better  estimates  of  aerobic 
capacity?  Can  the  relationship  between  distance  and  validity  be 
quantified?  What  is  the  shortest  test  that  yields  acceptable 
validity?  How  much  is  gained  by  increasing  the  test  length  beyond 
this  minimum?  The  expectation  was  that  the  first  two  questions 
would  be  answered  affirmatively  and  that  the  second  two  could  be 
answered  by  constructing  a  simple  mathematical  model  relating  run 
distance  to  validity. 

The  anticipated  answers  to  the  questions  posed  in  the 
initial  review  were  based  on  empirical  and  theoretical 
considerations.  Several  studies  have  shown  higher  validity  for 
longer  runs  (Burke,  1976;  Farrell,  Wilmore,  Coyle,  Billing,  & 
Costill,  1979;  Shaver,  1975;  Weyand,  Cureton,  Conley,  Sloniger,  & 
Liu,  1994) .  Mathematical  models  of  the  bioenergetics  of  running 
provide  a  theoretical  explanation  for  this  trend  (Capelli,  1999; 
di  Prampero  et  al.,  1993;  Ward-Smith,  1999).  These  models  suggest 
that  validity  will  increase  indefinitely  with  distance.  However, 
the  rate  of  increase  will  be  slower  as  distance  increases. 

The  review  results  were  unexpected.  The  fact  that  validity 
only  increased  up  to  2  km  meant  that  there  was  a  range  of  run 
distances  for  which  the  relationship  was  not  strictly  increasing 
as  expected.  Instead,  the  relationship  would  be  characterized 
mathematically  as  nondecreasing.  This  observation  is  critically 
important  when  modeling  the  validity  of  run  tests.  No  model, 
whether  linear,  curvilinear,  or  nonlinear,  that  predicts  higher 
validity  coefficients  for  longer  tests  will  fit  the  data. 

A  piecewise  (PW)  model  was  formulated  to  represent  the 
data.  The  model  was  piecewise  because  separate  equations 
predicted, validity  for  runs  above  and  below  the  2-km  threshold. 
For  runs  less  than  2  km,  the  predicted  validity  was  determined  by 
the  logarithm  of  the  distance.  For  runs  of  2  km  or  longer,  the 
prediction  was  a  constant.  Each  range  of  predictions  was  one 
piece  of  the  model. 


^  Validity  is  the  appropriateness  of  the  interpretation  of  a  test  score 
(American  Psychological  Association,  1985) .  Most  tests  can  be 
interpreted  more  than  one  way  and,  therefore,  have  more  than  one 
validity.  As  used  in  this  paper,  validity  refers  solely  to  the 
interpretation  of  run  test  performance  as  an  indicator  of  aerobic 
capacity.  In  this  context,  the  term  "validity  coefficient"  refers  to 
the  correlation  between  run  test  performance  and  maximal  oxygen  uptake 
capacity. 
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The  PW  model  answered  the  original  questions.  If  the 
minimum  acceptable  validity  for  a  run  test  is  r  =  .70,  2  km  is 
the  minimum  run  distance.  The  shortest  fixed-time  test  would  be 
12  min.  Adding  distance  or  time  to  these  minimum  values  does  not 
increase  test  validity. 

The  prior  review  has  two  major  implications.  First,  the 
results  defined  empirical  criteria  for  classifying  tests  as 
endurance  runs.  The  minimum  criteria  were  2  km  or  12  min.  Any  run 
meeting  either  criterion  is  an  endurance  test.  Although  the 
validity  of  fixed-distance  tests  and  fixed-time  tests  differed, 
the  data  indicated  validity  was  constant  within  each  category. 

Second,  the  evidence  provided  an  empirical  basis  for 
recommending  one  run  test  as  the  best  option  for  estimating 
aerobic  capacity.  Fixed-time  endurance  tests  produced  higher 
validity  coefficients  than  fixed-distance  endurance  tests.  The 
reason  for  the  difference  was  not  clear,  but  fixed-time  tests 
might  increase  the  likelihood  that  runners  will  adopt  the 
strategy  of  running  at  a  constant  pace  throughout  the  test.  This 
strategy  yields  optimal  performance  (Fukuba  &  Whipp,  1999) . 
Whatever  the  basis  for  the  difference,  the  best  test  for 
estimating  aerobic  capacity  would  be  the  shortest  fixed-time 
endurance  test.  This  test  will  yield  the  highest  validity  with 
the  least  effort  and  time  required  on  the  part  of  the  test 
takers.  The  test  also  is  valid  for  individuals  who  might  have 
trouble  running  longer  times  or  meeting  a  minimum  distance 
requirement  (Sidney  &  Shepard,  1977) .  Considering  these  criteria, 
the  best  run  test  would  be  a  12-min  timed  run. 

Neither  of  the  most  important  implications  of  the  prior 
findings  was  anticipated  when  the  review  was  undertaken. 
Unexpected  findings  should  be  viewed  with  skepticism  until  tested 
further.  Replication  is  a  constructive  response  to  skepticism. 
This  review,  therefore,  attempted  to  replicate  the  earlier  work. 
The  initial  data  set  was  extended  by  conducting  a  new  literature 
search  focused  on  physiological  threshold  variables  as  predictors 
of  running  performance  rather  than  maximal  aerobic  capacity.  The 
extended  search  identified  39  studies  that  reported  74  maximal 
oxygen  uptake  (VOamax)  running  performance  correlations  that  were 
not  included  in  the  initial  review.  These  data  were  used  to 
replicate  and  cross-validate  the  original  findings. 


Methods 


Literature  Search 

The  literature  search  had  three  primary  elements.  First, 
the  PubMed®  database  was  searched  using  "threshold"  and  "running" 
as  the  key  words .  The  general  term  threshold  was  used  in  the  hope 
that  it  would  identify  articles  that  dealt  with  various 
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thresholds  in  the  physiology  literature.  These  included 
ventilatory  threshold,  lactate  threshold,  and  anaerobic 
threshold. 

Articles  identified  in  the  PubMed  search  were  examined  to 
determine  whether  they  included  useful  data.  The  reference  lists 
for  articles  that  contained  at  least  one  useful  correlation  were 
examined  to  identify  other  studies  that  might  report  VOamax-running 
performance  correlations .  The  references  identified  in  this 
process  were  compared  with  the  list  of  articles  covered  in  the 

review  (Vickers,  2001) .  New  articles  were  examined  to  see 
whether  they  contained  results  that  could  be  used  in  this  review. 
This  step  comprised  the  ancestry  review  for  the  present  work. 

The  reference  catalog  at  San  Diego  State  University  was 
searched  to  identify  dissertations  and  theses  involving  running. 
The  list  was  compared  with  the  citations  in  Vickers  (2001)  to 
determine  which  work  had  been  examined  previously.  Those 
dissertations  and  theses  not  covered  in  the  earlier  review  were 
examined  to  see  whether  they  reported  either  correlations  that 
would  be  used  in  this  review  or  individual  data  that  could  be 
used  to  compute  correlations. 

The  literature  search  identified  39  studies  listed  in 
Appendix  A.  These  studies  reported  results  from  50  distinct 
samples.  The  samples  included  1,131  total  participants  who 
produced  1,769  running  performance  results.  The  outcome  was  a  set 
of  74  correlations,  56  from  published  sources,  including  books. 

The  other  18  correlations  were  from  theses  and  dissertations.  The 
average  sample  size  for  the  74  correlations  was  n  =  23.9. 

Data  Extraction 

The  information  extracted  from  each  report  consisted  of  the 
sample  size,  the  type  of  run  test  (fixed-distance  or  fixed-time), 
the  distance  run,  the  average  run  time,  and  the  VOamax-running 
performance  correlation.  Performance  was  recorded  a  number  of 
different  ways  in  different  studies.  Performance  on  fixed- 
distance  tests  was  usually  recorded  as  a  run  time,  but  sometimes 
was  represented  by  average  running  velocity.  Performance  on 
fixed-time  tests  typically  was  recorded  as  distance,  but 
sometimes  was  reported  as  a  predicted  VOz^ax-  V02„«x  predictions 
usually  were  computed  using  equations  that  involved  only  run 
distance.  However,  in  some  cases  the  predictions  were  based  on 
multivariate  equations  with  other  predictors,  such  as  weight  or 
gender . 

The  signs  of  correlations  with  run  time  as  the  performance 
criterion  were  reversed  so  that  correlations  would  have 
comparable  meaning  for  all  studies.  For  every  other  criterion, 
higher  values  indicated  better  performance.  The  correlations. 
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therefore,  were  nearly  all  positive.  In  contrast,  lower  scores 
indicated  better  performance,  and  nearly  all  correlations  were 
negative  when  run  time  was  the  criterion.  Reversing  the  signs  for 
these  correlations  meant  that  a  positive  correlation  indicated 
how  strongly  V02max  was  related  to  good  performance  for  all 
studies . 

A  separate  record  was  constructed  for  each  run  test  in  a 
study.  Thus,  a  study  that  included  1,500-m,  5-km,  and  10~km  runs 
produced  3  records,  one  for  each  distance.  Sample  attributes  were 
duplicated  on  each  record.  Each  record  was  treated  as  a  separate 
case  in  the  analysis.  This  decision  meant  that  the  cases  analyzed 
were  not  entirely  independent,  thereby  introducing  statistical 
complexities  for  significance  testing  (Becker  &  Schram,  1994; 
Steiger,  1980) .  The  common  meta-analytic  practice  of  averaging 
effect  sizes  to  produce  a  single  value  for  each  sample  was  not 
suitable  for  the  present  purposes.  Averaging  would  have  prevented 
meaningful  analysis  of  the  relationship  between  validity  and  test 
length. 

Analysis  Procedures 

As  Rosenthal  and  DiMatteo  (2001)  noted,  the  underlying 
logic  and  basic  computational  procedures  used  in  meta-analysis 
are  the  same  as  those  used  in  the  analyses  of  primary  data.  The 
basic  summary  statistics  are  weighted  average  correlations  and 
computations  of  variance  about  those  averages.  In  every  analysis, 
the  observed  correlations  are  compared  with  predicted  values 
based  on  the  model.  The  estimated  variance  for  the  model  provides 
a  test  of  statistical  significance.^ 

The  basic  analysis  followed  the  procedures  in  Chapter  11  of 
Hedges  and  Olkin  (1985).  Olkin  and  Pratt's  (1958)  formula  was 
used  to  correct  the  correlations  for  sample  size  bias.  Fisher's 
r-to-z  transformation  was  applied  to  normalize  the  distribution 
of  the  corrected  correlations  (Hays,  1963) .  The  transformed 
values  are  labeled  ZuF(i)  to  indicate  that  they  represent  z  value 
of  the  unbiased  Fisher-transformed  correlation  for  the  ith 
sample.  The  ZuF(i>  were  the  dependent  variables  in  analysis  of 
variance  and  regression  procedures  that  weighted  each  observation 
by  (ni  -  3),  where  ni  is  the  sample  size  for  the  ith  correlation. 
Using  this  weighting,  the  sums  of  squares  reported  for  the 
analyses  are  values  that  can  be  used  to  test  hypotheses. 

Three  models  were  evaluated  in  both  the  replication  and  the 
cross-validation : 


^Significance  tests  based  on  the  x^  values  should  be  interpreted  with 
some  caution  given  that  not  all  of  the  validity  coefficients  were 
independent.  However,  only  a  small  proportion  of  the  total  observations 
involved  dependent  coefficients.  Note  also  that  significance  tests  were 
not  the  basis  for  choosing  the  final  model  from  the  analyses. 
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A.  The  regression  model  used  the  logarithm  of  distance  to  predict 
zoF(i)  values.  This  model  is  referred  to  as  the  LogDist  model  to 
indicate  the  predictor  that  was  used  in  the  regression. 

B.  The  test-by-test  (TxT)  model  predicted  the  average  value  for 
each  of  9  groups.  Seven  groups  represented  specific  distances 
represented  in  the  data  set  by  3  or  more  correlations  (1  mile, 
2  km,  1.5  mile,  3  mile,  5  km,  10  km,  marathon).  Miscellaneous 
short  (<1,850  m;  n  =  7)  and  long  (>1,850  m;  n  -  9)  runs  were 
general  groups  that  included  all  correlations  for  distances 
represented  by  just  1  or  2  correlations  in  the  data  set.  This 
model  was  constructed  using  the  same  rules  as  the  TxT  model  in 
Vickers  (2001) .  The  specific  groups  included  differ  because  of 
differences  in  the  data  available  for  the  analyses  (see 
Appendix  B  for  original  model) . 

C.  The  PW  model  developed  by  Vickers  (2001)  regressed  Zoi-a)  on  the 
logarithm  of  distance  for  runs  <2  km,  then  estimated  a 
constant  value  for  runs  ^2  ]cm. 

This  review  considered  only  these  3  models  because  they 
were  the  most  promising  of  a  larger  set  of  models  evaluated  in 
Vickers  (2001) .  The  TxT  model  provided  the  best  overall  fit  to 
the  data  in  the  initial  review.  This  model  minimizes  the  squared 
error  in  predictions  for  each  run  distance  represented  by  3  or 
more  correlations.  The  TxT  model,  therefore,  provides  explanatory 
power  that  approaches  the  maximum  possible  value  when  distance  is 
used  as  a  predictor  of  validity.  The  LogDist  model  did  not  fit 
the  data  as  well  as  either  the  PW  or  TxT  models.  However,  the 
predicted  values  in  this  model  increase  continuously  with 
distance.  The  rate  of  increase  per  unit  distance  decreases  as 
distance  increases .  These  attributes  are  characteristic 
predictions  from  bioenergetic  models.  Thus,  this  model  was 
included  as  an  approximation  to  predictions  from  bioenergetic 
models  of  running  performance. 

Vickers  (2001)  adopted  the  PW  model  over  the  TxT  and 
LogDist  models  and  several  other  models  after  weighing  three 
criteria;  explanatory  power,  number  of  parameters  in  the  model 
(i.e.,  parsimony,  cf..  Popper,  1959),  and  relationships  to 
physiological  constructs.  Considering  the  3  models  evaluated 
here,  the  regression  model  was  simple  and  clearly  linked  to 
existing  constructs  but  had  the  least  explanatory  power.  The  TxT 
model  had  the  most  explanatory  power,  but  this  model  required 
many  more  parameters  than  either  alternative  model.  Further,  the 
pattern  of  mean  differences  as  a  function  of  distance  was 
irregular  and  did  not  have  a  clear  relation  to  physiological 
processes.  The  PW  model  provided  intermediate  explanatory  power, 
but  it  combined  parametric  parsimony  with  a  reasonable 
explanation  in  terms  of  known  physiological  mechanisms .  Constant 
validity  for  endurance  tests  could  be  explained  by  concepts  such 
as  anaerobic  threshold  or  critical  power.  The  PW  model  also  had 
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the  pragmatic  value  that  it  corresponded  well  to  a  simple  graphic 
representation  of  the  data. 

The  3  models  retained  from  the  initial  review  were  compared 
in  analyses  that  first  replicated  the  original  model  selection 
process.  For  these  analyses,  parameter  values  for  the  models  were 
estimated  from  the  present  data.  The  fit  of  each  model  was 
evaluated  using  the  same  criteria  as  in  the  initial  review.  This 
replication  was  undertaken  to  explore  the  possibility  that  the 
relative  ordering  of  the  models  was  specific  to  the  initial  data 
set. 


The  replication  analysis  was  followed  by  cross-validation 
analyses.  In  these  analyses,  the  model  parameters  were  fixed  at 
the  values  estimated  in  Vickers  (2001) .  The  parameter  values  are 
shown  in  Appendix  B.  The  cross-validation  represented  an 
important  shift  in  the  work  from  exploratory  analysis  to 
confirmatory  analysis. 

All  analyses  were  conducted  using  SPSS-PC  (SPSS,  Inc., 

1998a, b) .  The  weighted  GLM  and  REGRESSION  procedures  were  used. 
When  correlations  are  appropriately  transformed  and  weighted  as 
previously  described,  the  results  include  sums  of  squares  that 
provide  appropriate  values  for  testing  meta-analytic 
hypotheses  (Hedges  &  Olkin,  1985) . 

Parsimony-adjusted  goodness-of-fit  was  used  to  compare 
models.  The  Tucker  and  Lewis  (1973)  index  (TLI)  was  the  basic 
goodness-of-fit  indicator  (cf.,  Arbuckle  &  Wothke,  1999;  Rentier 
&  Bonett,  1980;  or  Bollen,  1989;  for  discussion  of  goodness-of- 
fit  indices) .  The  TLI  indicates  what  proportion  of  the  greater 
than  chance  variation  in  correlations  is  accounted  for  by  a 
model.  Mulaik  et  al.'s  (1989)  parsimony  adjustment  was  applied  to 
the  TLI  to  allow  for  the  fact  that  more  complex  models  almost 
always  provide  a  better  absolute  fit  to  the  data  than  simpler 
models.  The  final  model  criterion,  therefore,  was  the  parsimony- 
adjusted  Tucker-Lewis  Index  (PTLI) . 

Hoelter's  (1983)  critical  N  was  used  to  guard  against 
assigning  undue  importance  to  small  effects.  Even  trivial  effects 
can  be  statistically  significant  given  a  large  enough  sample  size 
(Rosenthal  &  Rosnow,  1984) .  Hoelter  (1983)  proposed  that  the 
potential  for  misleading  significance  tests  be  reduced  by 
determining  the  smallest  sample  size  for  which  an  observed 
difference  would  be  statistically  significant.  Hoelter  (1983) 
labeled  this  sample  size  the  critical  N  and  suggested  that  any 
effect  with  critical  N  >  200  was  too  small  to  be  theoretically  or 
practically  important.  This  frame  of  references  has  been  used 
when  evaluating  the  present  findings . 
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Table  1 .  Fit  of  Models  Estimated  From  the  Current  Data 


Model 

df 

Model 

Residual 

TLI 

PTLI 

LogDist 

1 

2.30 

116.99 

.010 

.010 

TxT 

8 

22.06 

97.24 

.163 

.144 

PW 

2 

11.76 

107.53 

.162 

.157 

Note.  See  text  for  model  definitions,  df  ==  degrees  of  freedom  for 
the  model.  Each  model  is  based  on  67  correlations  (66  df  maximum) 
with  a  total  of  119.30. 


Results 

Figure  1  plots  Zufu)  as  a  function  of  the  logarithm  of 
distance  for  Vickers's  (2001)  data  [Figure  1(a)]  and  the  present 
data  [Figure  1(b)].  The  figure  includes  LOESS  plots  (Cleveland, 
1979)  for  the  data.  The  most  important  aspect  of  Figure  1  is  that 
both  LOESS  plots  are  flat  for  distances  ^2  km.  The  horizontal 
reference  line  is  Vickers's  (2001)  PW  prediction  for  endurance 
runs  (i.e.,  =  .9026).  The  flat  portion  of  the  LOESS  curve  for 

for  the  present  data  is  slightly  lower  than,  but  approximately 
parallel  to  this  reference  line. 

Figure  1  also  shows  increasing  Zara;  runs  <2  km  in  both 
data  sets.  This  trend  is  poorly  defined  for  the  present  data. 

Only  a  few  data  points  are  available  for  short  runs.  The 
available  data  points  are  largely  restricted  to  the  range  of 
1,500  m  to  1,609  m.  The  two  curves  are  similar  for  the  data  that 
are  present. 

Model  Replication 

Fitting  the  models  to  the  data  replicated  the  earlier 
findings  (Table  1) .  The  LogDist  model  again  had  the  least 
predictive  power.  The  variation  explained  by  the  PW  model  was 
statistically  significant  (x^  =  11.76,  2  df,  p  <  .003)  but  the 
TxT  model" explained  more  (x^  =  22.06,  8  df,  p  <  .005). 

Some  findings  from  the  earlier  review  did  not  replicate. 

The  LogDist  model  was  statistically  significant  in  the  prior 
work,  but  not  in  these  data  (x^  =  2.30,  1  df,  p  >  .129).  The  TxT 
model  was  significantly  better  than  the  PW  model  in  the  prior 
review,  but  not  in  these  analyses  (Ax^  =  10.30,  6  df,  p  >  .112)  . 
Finally,  the  PW  PTLI  previously  had  been  smaller  than  the  TxT 
PTLI  (.363  vs.  .382),  but  was  larger  (.157  vs.  .144)  in  these 
analyses . 
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Table  2.  Cross-Validation  Statistics  for  Model 


Model 

Residual 

Difference 

f 

PTLI 

LogDist 

125.38 

8.38 

<.000 

TxT 

122.38 

25.15 

<.000 

PW 

109.70 

2.168 

.184 

Note.  All  models  have  67  df  because  no  parameters  were  estimated. 
See  text  for  definition  of  table  entries. 


Individual  elements  of  the  PW  model  replicated  well. 

Shorter  runs  (i.e.,  <2  )cm)  had  lower  average  validity  than  longer 
runs  =  9.09,  1  df,  p  <  .001).  The  lack  of  association  between 
distance  and  validity  for  longer  runs  (r  =  -.041,  x^  =  0.172,  1 
df,  p  >  .678)  replicated  a  second  PW  model  element. 

The  only  PW  model  component  that  did  not  replicate  clearly 
was  the  significant  relationship  between  distance  and  validity 
for  short  runs  =  2.68,  1  df,  p  >  .101,  for  the  present  data). 
However,  the  same  trend  was  present  and  approached  significance 
(p  <  .051)  using  a  one-tailed  test  to  allow  for  the  fact  that  the 
direction  of  the  relationship  was  known.  Note  also  that  there 
were  only  a  few  short  tests  (n  =  13) ,  most  of  which  represented  a 
narrow  range  of  distance  (9  of  13  either  1,500  m  or  1,  609  m)  . 
Taken  in  context,  this  element  of  the  PW  model  replicated 
reasonably  well  within  the  constraints  of  the  data. 

Cross -Validation 

Table  2  summarizes  the  results  of  the  cross-validation 
analyses.  The  residual  values  reported  in  the  table  are  the 
result  of  fitting  the  corresponding  Vickers  (2001)  model  to  the 
present  data.  The  difference  x^  is  the  difference  between  the 
cross-validation  fit  and  the  replication  fit  of  the  model  (see 
Table  1) .  PTLI  was  computed  for  each  model  with  the  null  model  x^ 
=  119.30..  This  figure  was  the  overall  x^  for  the  set  of  validity 
coefficients.  Model  x^  values  can  be  greater  than  this  reference 
point  because  applying  the  parameter  estimates  from  the  earlier 
review  to  the  present  data  can  produce  differences  between 
predicted  and  observed  correlations  that  are  larger  than  the 
differences  between  the  observed  values  and  the  mean  correlation 
for  the  present  data.  The  PTLI  is  negative  when  this  outcome  is 
obtained. 

Cross-validation  analyses  clearly  supported  the  PW  model. 
First,  the  fit  of  the  PW  model  (x^  =  109.70)  was  12.5%  better 
than  the  LogDist  model  (x^  =  125.38)  and  10.4%  better  than  the 
TxT  model  (x^  =  122.38). 
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Robustness  of  the  model  parameters  was  another  indication 
of  how  well  the  PW  model  cross-validated.  The  difference  in  fit 
between  replication  and  cross-validation  was  not  significant  for 
the  PW  model  (y^  =  2.17,  3  df,  p  >  .538).  This  result  indicates 
that  the  original  parameter  values  for  the  model  were  very  close 
to  the  estimated  values  in  the  replication  analysis.  Sample- 
specific  parameters  were  significantly  better  than  the  replicated 
values  for  the  LogDist  (x^  =  8.38,  3  df,  p  <  .039)  and  TxT  (x^  = 
28.68,  14  df,  p  <  .012)  models.^ 

The  goodness-of-fit  index  was  the  third  reason  to  prefer 
the  PW  model  to  the  alternatives.  The  PW  PTLI  was  positive;  the 
LogDist  and  TxT  PTLI  were  negative.  The  negative  PTLI  indicated 
that  bias  in  the  cross-validation  estimates  (i.e.,  the  tendency 
for  predictions  to  err  consistently  in  the  same  direction)  was 
sufficient  to  offset  whatever  predictive  power  the  models  had  for 
the  new  data. 

The  goodness-of-fit  statistics  produced  another  indication 
that  the  PW  model  was  robust.  The  cross-validation  PTLI  was 
larger  than  the  replication  PTLI  (.184  vs.  .157).  The  reversal 
occurred  because  the  lost  degrees  of  freedom  associated  with 
estimating  sample-specific  parameters  in  the  replication  model 
more  than  offset  the  statistically  nonsignificant  gain  in 
predictive  accuracy.'^ 

Detailed  Cross-Validation  of  PW  Model 

The  cross-validated  PW  model  fit  all  of  the  data  reasonably 
well.  The  most  important  element  of  the  model  was  the  prediction 
that  Zof'  =  .9026  for  runs  ^2  km  because  these  runs  comprised 

^The  TxT  cross-validation  was  based  on  predictions  for  11  of  24 
groups  in  the  earlier  review.  The  estimates  from  the  earlier 
model  were  applied  to  all  tests  that  fell  in  1  of  the  24  distance 
categories  in  that  model.  Thus,  some  tests  classified  as 
miscellaneous  in  the  present  review  had  distance-specific 
predictions . 

^Negative  PTLI  values  were  obtained  when  cross-validation 
>baseline  Biased  cross-validation  predictions  produced  this 
outcome.  Bias  is  a  consistent  tendency  toward  underestimation  or 
overestimation.  The  average  bias  (weighted  by  n  -  3)  was  +.059 
for  the  TxT  model,  +.037  for  the  LogDist  model,  and  +.027  for  the 
PW  model.  The  biases  were  small  [critical  N  (p  <  .05)  1,107, 
2,810,  and  5,273  for  the  TxT,  LogDist,  and  PW  models, 
respectively] and  differed  trivially.  The  critical  N  for  the 
largest  difference  was  7,506.  Bias  added  10.70  to  the  PW  x^, 

11.90  to  the  Txt  x^/  16.35  to  the  LogDist  x^ • 
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5/6ths  of  the  data.  The  weighted  average  in  the  present  data  was 
=  .8781.  The  critical  N  for  the  difference  is  6,403.  After 
back“transforming  the  z  values,  the  difference  in  the  estimated 
validity  coefficients  was  r  =  .7176  versus  r  =  .7055. 

Predictions  for  short  runs  were  less  important  for  the 
overall  fit  of  the  model  because  there  were  fewer  short  runs. 
Here,  too,  the  predictions  were  accurate.  The  800-m  prediction  (n 
=  1  correlation)  was  slightly  high  (+.011).  The  1-km  (n  =  2) 
prediction  was  slightly  low  (-.017).  The  1-mile  (n  =  4) 
prediction  was  very  close  to  the  observed  value  (+.002).  The 
largest  discrepancy  between  observed  and  predicted  value  was 
+.062  for  the  1,500-m  run  (n  =  2) .  The  critical  N  for  the  1,500-m 
difference  was  1,003.  The  critical  N  would  be  substantially 
higher  for  each  other  distance  because  of  the  smaller 
discrepancies.  The  overall  model  fit,  therefore,  reflected  good 
fit  at  each  cross-validated  point. 

Fixed  Versus  Random  Effects 

The  replication  analyses  strengthened  the  choice  of  the  PW 
model.  Therefore,  fixed-  and  random-effects  versions  of  this 
model  were  compared.  The  fixed-effects  model  (%^  =  109.70)  had 
slightly  better  predictive  accuracy  than  the  random-effects  model 
(X^  =  115.18)  when  cross-validated. 

Best  Estimate  Model 

The  prior  analyses  indicated  that  the  fixed-effects  PW 
model  was  the  best  representation  of  the  data.  The  present  data 
were  combined  with  those  from  Vickers  (2001)  to  estimate  the 
parameters  of  that  model  using  all  of  the  data.  The  resulting  PW 
model  was : 


If  distance  <2  km,  z'  =  (0.225*L)  -  .0615 
If  distance  ^2  km,  z'  =  .8960 

where  z'  Is  the  Fisher  transformation  of  the  unbiased  correlation 
coefficient  and  L  is  the  logarithm  of  distance.  Pooling  the  data 
left  the  slope  of  the  regression  for  short  runs  unchanged  at 
0.225.  Pooling  reduced  the  regression  intercept  slightly  from  the 
earlier  value  of  -.0036  to  -.0615.  Pooling  reduced  the  estimated 
value  for  long  runs  from  .8960  to  .9026  (critical  N  =  88,195). 
After  back-transformation,  the  revised  estimate  of  the  validity 
coefficient  was  r  =  .714  compared  with  r  =  .718  in  the  initial 
review.  The  estimated  validity  coefficient  applied  equally  well 
to  all  distances  as  indicated  by  the  fact  that  distance  and  Zu^a) 
were  independent  (r  =  -.009)  from  2  km  upward. 
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Fixed-Time  Tests 

The  data  included  7  fixed-time  correlations.  Five  of  these 
correlations  were  for  runs  >12  min.  The  average  correlation  for 
those  5  tests  was  r  =  .807.  This  value  was  significantly  = 
6.42,  1  df,  p  <  .012)  larger  than  the  average  for  fixed-distance 
tests  in  this  review  (r  =  .706).  Both  values  were  very  similar  to 
the  estimates  derived  in  the  prior  review  (fixed-time,  r  =  .793; 
fixed-distance,  r  =  .718).  Critical  Ns  for  the  differences 
between  the  two  reviews  were  N  >  2,542  for  fixed-time  tests  and  N 
>  6,508  for  fixed-distance  tests.  The  pooled  significance  for  the 
difference  between  the  fixed-time  and  fixed-distance  correlations 
was  p  <  .0001  by  the  method  of  adding  probabilities  (Rosenthal, 
1978).  The  pooled  average  for  fixed-time  tests  was  r  =  .798. 


Discussion 

This  extension  of  Vickers's  (2001)  review  strongly 
supported  the  PW  model  relating  run  distance  to  validity.  The 
LOESS  plot  of  validity  as  a  function  of  distance  provides  the 
most  direct  indication  of  support  for  the  model.  This  graphic 
representation  showed  increasing  validity  for  short  runs  and 
constant  validity  coefficients  for  long  (i.e.,  ^2  km)  runs.  These 
two  trends  are  the  essential  elements  of  the  PW  model.  Although 
this  replication  included  relatively  few  short  runs,  the  LOESS 
lines  clearly  were  very  similar. 

Replication  provided  formal  quantitative  support  for  the 
original  model  selection  process.  The  PW  model  once  again  had 
less  predictive  power  than  the  TxT  model  and  more  predictive 
power  than  the  regression  model.  Beyond  this  basic  similarity, 
however,  there  were  differences  between  the  initial  review  and 
the  present  replication.  Where  Vickers  (2001)  found  that  the  TxT 
model  had  significantly  greater  predictive  accuracy  than  the  PW 
model,  the  difference  was  not  significant  in  this  replication. 
Where  Vickers  (2001)  found  that  the  PTLI  for  the  TxT  model  was 
slightly  larger  than  the  PTLI  for  the  PW  model,  the  replication 
reversed  "this  ordering.  Thus,  2  of  3  criteria  that  gave  reason  to 
consider  choosing  the  TxT  model  over  the  PW  model  in  the  initial 
review  were  reversed  in  this  replication.  The  only  remaining 
criterion  favoring  the  TxT  model  was  the  better  absolute  fit  of 
the  model  to  the  data.  The  PTLI  comparisons  indicate  that 
absolute  fit  is  a  weak  criterion,  given  the  substantial 
difference  in  complexity  between  the  PW  and  TxT  models.  Applying 
the  same  criteria  used  in  the  earlier  review,  the  results  of  this 
replication  would  lead  to  the  adoption  of  a  PW  model . 

Cross-validation  underscored  the  replication  trends.  The 
explanatory  power  of  the  PW  model  was  more  than  10%  greater  than 
either  competing  model.  The  fact  that  the  cross-validation  fit  of 
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the  PW  model  was  nearly  as  good  as  the  fit  in  the  replication 
analyses  indicated  that  the  model  parameters  were-  robust.  In 
fact,  the  difference  in  fit  was  not  statistically  significant, 
and  the  cross-validation  PTLI  for  the  PW  model  was  larger  than 
the  simple  replication  PTLI  for  this  model.  The  failure  of  the 
other  two  models  to  cross-validate  is  underscored  by  the  fact 
that  the  PTLI  was  negative  for  both  competing  models. 

These  findings  provide  strong  empirical  support  for  the  PW 
model.  The  cross-validation  results  are  particularly  noteworthy. 
These  analyses  provided  a  very  strong  test  of  the  original 
models.  The  cross-validation  test  of  a  model  was  a  stringent 
criterion  because  the  parameters  were  fixed  at  specific  values 
derived  in  the  earlier  review.  This  aspect  of  cross-validation 
analyses  meant  that  a  specific  value  was  predicted  for  each 
validity  coefficient  in  the  cross-validation  analyses.  These 
point  predictions  increased  the  risk  of  failure  for  the  model. 
Observed  values  had  to  be  close  to  the  specific  predicted  values 
to  avoid  a  significant  misfit  between  the  data  and  the  model. 

This  requirement  contrasts  with  null  hypothesis  testing 
procedures  that  treat  any  observed  value  that  is  significantly 
different  from  zero  as  support  for  a  model.  Cross-validation 
requires  that  the  typical  finding  lie  within  the  95%  confidence 
interval  around  the  predicted  value  given  the  sample  size.  This 
confidence  interval  will  be  narrower  than  the  range  of  all  values 
that  differ  significantly  from  zero.  The  greater  constraint  on 
the  range  of  data  that  indicate  acceptable  fit  of  the  model  makes 
the  confirmatory  cross-validation  more  likely  to  fail  than  the 
exploratory  test  of  a  null  hypothesis  model.  In  this  sense,  the 
cross-validation  was  a  stronger  test  of  the  models  (Meehl,  1990) . 

The  risks  associated  with  cross-validation  were  clearly 
evident  in  the  results  of  these  analyses .  Two  of  the  3  models 
cross-validated  so  poorly  that  they  had  negative  PTLI  values. 

Only  the  PW  model  produced  a  positive  PTLI.  The  fact  that  the  fit 
of  the  cross-validated  PW  model  was  not  significantly  different 
from  the  fit  of  the  PW  model  with  sample-optimized  parameter 
values  further  strengthened  this  model.  This  close  fit  between 
the  data  and  specific  point  predictions  for  each  observation  in 
the  data  set  is  what  Meehl  (1990)  refers  to  as  "a  darned  strange 
coincidence."  Such  coincidence  should  strengthen  faith  in  the 
model . 

Several  characteristics  of  the  PW  model  could  account  for 
its  cross-validation  success.  Parsimonious  models  provide  more 
precise  parameter  estimates  (Bentler  &  Mooijaart,  1989) .  Precise 
estimates  should  increase  accuracy  when  applied  to  new  data 
because  they  are  less  likely  to  be  substantially  different  than 
the  population  parameters  that  are  being  estimated.  A  second 
point  to  consider  is  the  fact  that  parsimonious  models  have  less 
opportunity  to  capitalize  on  chance.  Fewer  parameters  are  fitted, 
so  it  is  less  likely  that  unnecessary  parameters  will  be  included 
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by  chance.  The  addition  of  parameters  that  represent  chance 
trends  in  the  initial  data  will  increase  error  when  applied  to 
new  data  that  do  not  include  those  chance  trends .  Having  fewer 
parameters  also  reduces  the  likelihood  that  chance  observations 
will  lead  to  serious  errors  in  the  estimation  of  parameters  that 
truly  belong  in  the  model . 

One  characteristic  of  the  PW  model  that  may  have  accounted 
for  its  cross-validation  success  is  particularly  noteworthy.  The 
PW  model  was  based  on  fitting  mathematical  functions  to  the  data. 
This  statement  was  true  even  for  runs  of  2  km  or  longer.  In  those 
cases,  the  function  was  a  constant,  but  that  constant  was 
essentially  the  intercept  in  a  regression  analysis  with  a  zero 
slope.  As  a  result,  the  prediction  for  any  given  distance  is 
influenced  by  the  pattern  of  data  for  other  run  distances.  This 
dependency  on  the  overall  pattern  of  data  means  the  model 
"borrows  strength"  from  other  evidence  in  the  data  when 
estimating  the  value  at  a  given  spot  (National  Research  Council, 
1992) .  The  borrowing  effect  should  help  correct  errors  that  arise 
when  just  a  few  data  points  are  available  to  estimate  the 
correlation  for  a  given  run  distance.  In  such  cases,  a  single 
data  point  that  was  seriously  in  error  could  significantly  bias 
the  estimate  for  that  distance.  Fitting  a  function  to  the  data 
yields  estimates  that  smooth  the  curve  by  making  the  estimate 
consistent  with  nearby  values  rather  than  relying  just  on  the 
data  for  that  specific  distance. 

This  review  also  replicated  the  difference  between  fixed¬ 
time  and  fixed-distance  tests.  Fixed-time  endurance  runs  were 
more  valid  than  fixed-distance  endurance  runs.  The  estimated 
validity  for  each  type  of  test  was  very  similar  to  that  obtained 
in  the  prior  review.  On  the  whole,  a  fixed-time  endurance  test 
increases  validity  .084  relative  to  a  fixed-distance  endurance 
test  (fixed-time,  r  =  .798;  fixed-distance,  r  =  .714).  The 
absolute  difference  is  modest,  but  simple  magnitude  comparisons 
can  be  misleading.  For  example,  if  a  run  test  were  to  be  used  to 
decide  who  meets  a  pass-fail  criterion  (e.g.,  50th  percentile  of 
a  distribution),  the  fixed-time  test  would  classify  8.4%  more 
people  correctly  (Rosenthal  &  Rubin,  1978) . 

The  replication  and  cross-validation  analyses  reinforce  the 
surprising  answer  to  several  questions  addressed  in  Vickers's 
(2001)  review.  A  12-min  run  is  the  best  option  for  estimating 
aerobic  capacity.  The  standard  error  of  estimate  (SEE)  for 
aerobic  capacity  using  this  test  is  ~3.8  ml/kg/min.®  Laboratory 
V02max  test  precision  is  ~3.0  ml/kg/min  when  the  same  protocol  is 
repeated  twice  or  more  (Froehlicher  et  al.,  1974;  Katch,  Sady,  & 
Freedson,  1982;  Safrit,  Hooper,  Ehlert,  Costa,  &  Patterson, 


’see  =  SD  *  V(1  -  r^)  where  SD  =  6.24,  the  weighted  average  SD  for  all 
samples  in  the  two  reviews  and  r  =  .798,  the  average  correlation 
between  the  12-min  run  and  the  VOamax  assessments. 


14 


Distance-VOamax  Replication 


1988) .  Estimates  from  a  12-min  run  will  have  an  SEE  ~25%  greater 
than  the  reference  standard.  The  increase  is  not  trivial,  but  it 
may  be  acceptable  in  many  situations . ® 

This  review  suggests  that  one  factor  that  might  bias  meta- 
analytic  findings  is  unimportant  in  the  present  research  domain. 
Increasing  the  scope  of  the  literature  review  had  little  effect 
on  the  estimated  validity  coefficients.  It  is  very  unlikely  that 
even  the  347  correlations  examined  in  the  combined  reviews 
exhaust  the  literature  in  this  area.  However,  the  fact  that  two 
different  search  strategies  produced  very  similar  results  makes 
it  less  likely  that  omitted  studies  would  change  the  results 
substantially.  The  reviews  produced  similar  estimates  despite 
differences  in  the  proportional  representation  of  published  and 
unpublished  studies .  This  outcome  suggests  that  publication  bias 
is  not  a  major  factor  in  this  domain. 

The  preceding  conclusion  is  subject  to  one  critically 
important  qualification.  The  inference  is  based  on  trends 
averaged  across  many  types  of  people.  The  samples  included  males 
and  females,  young  and  old,  and  athletes  and  untrained 
individuals.  Previous  reviewers  have  cautioned  that  findings  may 
not  generalize  across  populations  (Baumgartner  &  Jackson,  1982; 
Safrit  et  al.,  1988).  These  cautions  are  still  relevant.  Figure  1 
clearly  shows  that  correlations  vary  widely  for  runs  tl  km. 
Population  differences  may  be  one  source  of  this  variation. 

The  sources  of  variation  in  the  validity  of  endurance  run 
tests  will  be  the  topic  of  a  companion  review  (Vickers,  in 
preparation).  This  replication  of  Vickers's  (2001)  earlier 
findings  sets  the  stage  for  a  meaningful  assessment  of  this  topic 
by  providing  empirical  criteria  defining  endurance  runs .  Runs  ^2 
km  or  >12  min  share  a  common  validity  within  test  type.  The  run 
test  categories  thus  defined  can  both  be  classified  as  endurance 
runs.  With  this  point  established,  analysis  of  the  variation  in 
validity  coefficients  for  endurance  runs  can  determine  whether 
validity  generalizes  for  endurance  runs. 


*  The  standard  deviation  specified  for  V02max  tests  applies  to  repetitions 
of  a  single  protocol.  Differences  between  protocols  would  be  a  more 
appropriate  frame  of  reference.  The  reference  SEE  for  that  comparison 
would  be  larger  because  the  SEE  would  include  variance  attributable  to 
protocol  differences. 
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Appendix  B 

Cross-Validation  Models 


Logarithm  of  Distance  (LogDist)  Model: 


ZuFw'  =  0.240*L  -.013 


Test-by-Test  (TxT)  Model : 

Distance  (m)  Zuf(i) ' 


800 

1000 

1500 

1609 

2000 

2414 

3200 

4827 

5000 

10000 

21100 

42200 

Misc  Short 
Misc  Long 


.500 

1.024 

.744 

.732 

1.017 

1.075 

.832 

.806 

.932 

.715 

.880 

1.015 


Piecewise  (PM)  Model : 

If  (distance  <  2000  m)  zufu)'  =  0.225*L  -  .036 
If  (distance  t  2000  m)  Zufu)'  =  .903 


Note.  "L"  indicates  the  logarithm  of  distance  in  meters. 
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